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Abstract — The problem of statistical learning is to construct 
an accurate predictor of a random variable as a function of 
a correlated random variable on the basis of an i.i.d. training 
sample from their joint distribution. Allowable predictors are 
constrained to lie in some specified class, and the goal is to 
approach asymptotically the performance of the best predictor 
in the class. We consider two settings in which the learning 
agent only has access to rate-limited descriptions of the training 
data, and present information-theoretic bounds on the predictor 
performance achievable in the presence of these communication 
constraints. Our proofs do not assume any separation structure 
between compression and learning and rely on a new class 
of operational criteria specifically tailored to joint design of 
encoders and learning algorithms in rate-constrained settings. 



I. Introduction 

Let X E X and Y € y be jointly distributed random 
variables. The problem of statistical learning is to design an 
accurate predictor of the output variable Y from the input 
variable X on the basis of a number of independent training 
samples drawn from their joint distribution, with very little 
or no prior knowledge of that distribution. The present paper 
focuses on the achievable performance of learning schemes 
when the learning agent only has access to a finite-rate 
description of the training samples. 

This problem of learning under communication constraints 
arises in a variety of contexts, such as distributed estimation 
using a sensor network, adaptive control, or repeated games. 
In these and other scenarios, it is often the case that the agents 
who gather the training data are geographically separated 
from the agents who use these data to make inferences and 
decisions, and communication between these two types of 
agents is possible only over rate-limited channels. Hence, there 
is a trade-off between the communication rate and the quality 
of the inference, and it is of interest to characterize this trade- 
off mathematically. 

This paper follows on our earlier work [1] and presents 
improved bounds on the achievable performance of statistical 
learning schemes operating under two kinds of communication 
constraints: (a) the entire training sequence is delivered to the 
learning agent over a rate-limited noiseless digital channel, and 
(b) the input part of the training sequence is available to the 
learning agent with arbitrary precision, while the output part 
is delivered, as before, over a rate-limited channel. Whereas 
[1] has looked at schemes where the finite-rate description of 



the training data was obtained through vector quantization, ef- 
fectively imposing a separation structure between compression 
and learning, here we remove this restriction. 

We show that, under certain regularity conditions, there 
is no penalty for compression of the training sequence in 
the setting (a). This is due to the fact that the encoder can 
reliably estimate the underlying distribution (in the metric 
specifically tailored for the learning problem at hand) and then 
communicate the finite-rate description to the learning agent, 
who can then find the optimum predictor for the estimated 
distribution. The setting (b), however, is radically different: 
because the encoder has no access to the input part of the 
training sample, it cannot estimate the underlying distribution. 
Instead, the encoder constructs a finite-rate description of the 
output part using a specific kind of a vector quantizer, namely 
one designed to minimize the expected distance between the 
underlying distribution (whatever it may happen to be) and the 
empirical distribution of the input/quantized output pairs. Our 
achievability result for the setting (b) uses a learning-theoretic 
generalization of recent work by Kramer and Savari [2] on 
rate-constrained communication of probability distributions. 

The problem of learning a pattern classifier under rate 
constraints was also treated in a recent paper by Westover and 
O' Sullivan [3]. They assumed that the underlying probability 
distribution is known, and the rate constraint arises from the 
limitations on the memory of the learning agent; then the 
problem is to design the best possible classifier (without any 
constraints on its structure). The motivation for the work in 
[3] comes from biologically inspired models of learning. The 
approach of the present paper is complementary to that of [3]. 
We consider a more general, decision-theoretic formulation 
of learning that includes regression as well as classification, 
but allow only vague prior knowledge of the underlying 
distribution and assume that the class of available predictors 
is constrained. Thus, while [3] presents information-theoretic 
bounds on the performance of any classifier (including ones 
that are fully cognizant of the generative model for the data), 
here we are concerned with the performance of constrained 
learning schemes that must perform well in the presence of 
uncertainty about the underlying distribution. 

The novel element of our approach is that both the oper- 
ational criteria used to design the encoders and the learning 
algorithm, and the regularity conditions that must hold for rate- 
constrained learning to be possible, involve a tight coupling 



between the available prior knowledge about the underlying 
distribution and the set of predictors available to the learning 
agent. Planned future work includes obtaining converse theo- 
rems (lower bounds) and applying our formalism to specific 
classes of predictors used in statistical learning theory. 

II. Preliminaries and problem formulation 

A very general decision-theoretic formulation of the learn- 
ing problem, due to Haussler [4], goes as follows. We have 
a family V of probability distributions on Z = X x y and a 
class T of measurable functions / : Z — ► M. For any PeP, 
define 



L(f,P)^E P [f(Z)} = J f(z)dP(z), f 



€ T 



and 



V(T,P) = in! : L(f,P), 



where we assume that the infimum is achieved by some /* G 
T. The family V represents prior knowledge about the joint 
distribution of X and Y; each function / G T corresponds to 
the loss incurred by a particular predictor of Y based on X. 
This framework covers, for instance, the following standard 
scenarios: 

. classification — X C R d , y = {1,...,M}, and T 
consists of functions of the form 



f(x,y) = Is 



where Ir.\ is the indicator function, and Q is a given 
family of classifiers, i.e., measurable functions g : X — > 
{1, . . . , A/}. Any f*eT that achieves L*(T, P) corre- 
sponds to some g* 6 Q that has the smallest classification 
error: P(g*(X) ? Y) = inf gee P(g(X) ? Y). 
regression — X C M. d , y C R, and T consists of 
functions of the form 



f(x,y) = (g(x) - yf 



9 eg 



where Q is a given family of estimators, i.e., measurable 
functions g : X — > R. Any /* G T that achieves 
L*{T, P) corresponds to some g* G Q that has the 
smallest mean squared error: ~Kp[(g*(X) — Y) 2 } = 
mf geg E P {(g(X)-Yn 
These are instances of supervised learning problems. Unsu- 
pervised settings, where y — (such as density estimation 
or clustering), can also be accommodated by Haussler's frame- 
work. In this paper we focus only on the supervised case; thus, 
we will assume that \y\ > 2. Then the learning problem is to 
construct, for each n G N, an approximation to /* on the basis 
of a training sequence Z n = {Zi}f =1 , where Zi = (Xi,Y{) 
are i.i.d. according to some unknown PeP. 

Formally, a learning scheme (or learner, for short) is a 
sequence {f n }^Li of maps f n : Z n x Z — > R, such that 
f n (z n ,-) G T for all z 11 G Z n . Let Z = (X,Y) ~ P be 
independent of the training sequence Z n . The main quantity 
of interest is the generalization error 

L(f n ,P)=E\f n (Z n ,Z) Z n ]= f f n (Z n ,z)dP(z), 



which is a random variable that depends on the training 
sequence Z n . Under suitable regularity conditions on V and 
T, one can show that there exist learning schemes that are 
probably approximately correct (PAC), i.e., for every e > 
and PeP, 



lim P[Z n : L{f n , P) > L*{T, P) + e )= 



(2.1) 



(see, e.g., Vidyasagar [5]). A more modest goal is to ensure 
that the excess loss L(f n ,P) — L*(T,P) is small, either in 
probability or in expectation. 

We are interested in the achievable excess loss in situations 
where there is a rate-constrained channel between the source 
of the training data and the learning agent. Specifically, we 
shall consider the following two scenarios, depicted in Figs. [TJ 
and |2] respectively. 
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Fig. 1. Type I set-up: the encoder has full observation of the training samples. 

In the first set-up, shown in Fig. [TJ the learner observes the 
training data through a noiseless digital channel that can trans- 
mit a fixed finite number of bits per training pair Z = (X, Y). 
A scheme for learning operating at rate B is specified by a 
sequence {(e„, /„)}£° =1 , where e„ : Z n -> {1, 2, . . . , M n } is 
the encoder and /„ : {1,2,..., M n } — > T is the learner, such 



that limsup„_ 



log M n < R. For each n, the output of 



the learner is a function f n (J, •) G T , where J = e n {Z n ) is 
the finite-rate description of Z n provided by the encoder. We 
shall refer to this as Type I set-up. 
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Fig. 2. Type II: the encoder sees only the output part of the training sequence. 

In the second set-up, shown in Fig. [2] the learner has 
perfect observation of the input (^-valued) part of the train- 
ing sequence, while the output (3^-valued part) is delivered 
over a rate-limited noiseless digital channel. A scheme for 
learning operating at rate R is a sequence {(e n , fn)}^=ii 
where e n : y n — > {1, 2, . . . , M n } is the encoder and 
/„ : X n x {1,2,...,M„} -> T is the learner, such that 
limsupn^oo nT 1 logAf„ < R. For each n, the output of the 
learner is a function f n (J,X n , •) G T, where J = e n (Y n ) is 
the finite-rate description of Y n provided by the encoder. 

We shall often abuse notation and let /„ denote also the 
function in T returned by the learner. The main object of 
interest is the generalization error 



L(e n ,f n ,P) 



Uw n ,z) 



Z" 



PeP 



where Z — (X, Y) ~ P is assumed independent of {Zi}2 =1 , 
and W n is equal to J = e n (Z n ) in a Type I set-up and to 



(J, X n ) in a Type II set-up, where J = e n (Y n ). We are 
interested in the achievable values of the asymptotic expected 
excess loss. We say that a pair (R, A) is achievable for (J 7 , V) 
if there exists a scheme {(e n , f n )}^Li operating at rate R, 
such that 

limsupEL(e n ,/ n ,P) < L*(F,P) + A, VP £ P. 

n — >oo 

III. ACHIEVABILITY THEOREMS 

In this section, we prove two theorems about achievable 
pairs (R, A) in Type I and Type II settings. The key idea 
in both cases is that the encoder needs to provide enough 
information at rate R for the learner to estimate the expected 
value of each / e T to within A. 

A. Notation, preliminaries and assumptions 

We assume that the space Z is equipped with an appropriate 
er-algebra A. Typical cases of interest in learning theory are 
X C R d and y finite (classification) or X C R d and y C K 
(regression), with the usual Borel cr-algebras. The space of all 
probability measures on (Z, A) will be denoted by A4(Z). T 
is a class of measurable functions from (Z,A) into [Q,B] 
for some < B < +00; to avoid various measurability 
issues, we also assume throughout that T is countable. We 
shall identify signed measures p, on (Z,A) with real-valued 
linear functionals / 1— > n(f) on T, where = j z fdp. 

Thus, to each /i we can associate the l°° (T)-nom\ 

|| M |k = supK/)|. 
far 

For an n-tuple z n £ Z n , P z n will denote the corresponding 
empirical measure: P z ™ = n~ 1 X)iLi ( ^!i> where <5 Z is the 
Dirac measure (point mass) concentrated at z e Z. We assume 
that J 7 is a Glivenko-Cantelli (GC) class [6], i.e., 

lim ||P Z » -P||^ = 0, a.s. (3.2) 

n — >oo 

for every P G In other words, the class T is such that, 

for each P £ A4(Z), the sample averages Pz n (f) converge 
to the theoretical averages P(f) uniformly over T. This is a 
standard assumption in statistical learning theory [5], [6]. 

B. Type I schemes 

We now show that, in a Type I set-up, there is no penalty 
for compression of the training sequence, provided the family 
V is not too "rich." Our notion of richness will pertain to the 
geometry of V w.r.t. the || • 1 1 jc- norm. Given some e > 0, we 
say that a finite set {Pi, . . . , Pm} C V is an e-net for V if 

sup min ||P — P m ||jf < e. 

We define the covering number Nj?(e, V) as the cardinality of 
the minimal e-net of V, and the Kolmogorov e-entropy of V 
™Hr(e,V)=logNr(e,V) [7]. 

Theorem 3.1, Suppose that there exists a monotone decreasing 
sequence {e n }^ =1 of nonnegative reals, such that 

Hr(e n ,V) =o{n). (3.3) 



Then the pair (0,0) is achievable for (T 1 V). 

Proof: For each n, let J\f n = {Pi, P 2 , . . . , P Mn } be the 
minimal e rl -net for V w.r.t. || • where M n — Njr(e n ,P). 
Consider the following scheme: 

. encoder — e n (Z n ) = argmin \\P Zn - P m \\f 

l<m<M„ 

. learner — f n (J, •) = arg min Pj{f) 

In other words, the encoder finds the element of Af n closest to 
the empirical distribution Pz^ in the || • 1 1 jc- norm and transmits 
its index to the learner. The learner then finds the function in 
T that minimizes the expected loss assuming that the true 
distribution is the one estimated by the encoder. 

It is easy to see that the resulting scheme operates at zero 
rate. Indeed, from (13 . 3b , 

n — >oc Ti n — >oo Ti 

To bound the expected loss, assume that P 6 V is the true 
distribution and let P m » £ J\f n be the element of the e„-net 
that is closest to P, i.e., 

\\P-Pm*\\F= min \\P-P m y< e n . 

l<m<M 

Let J = e n (Z n ). We then have 

L(e n ,/„,P) = P(/„) 

< \\P-Pjy + Pj{fn) 

= \\p-Pjy+L*(r,Pj) 
<2\\p-Pjy + L*(T,p) 

< 2||P - P Z n + 2\\P Z n - Pj\\jr + L*(T, P) 
(b) 

< 2||P-P Z «||^ + 2||P Z „ -P m 4 F + L*{T,P) 

< 4||P - Pz- y + 2\\P - P m . y + L*(T, P) 
<A\\P-P Zn y + 2e n + L*(T, P), 

where (a) follows from the fact that 

\L*(T,P) - L*(T,P')\ < \\P-P'y 

for any two P, P 1 £ V, and (b) is by construction of the 
encoder. The remaining steps are consequences of various 
definitions and the triangle inequality. Taking expectations and 
the limit as n — > 00, we get 

lim EL{e. n ,f n ,P) 

n— >oo 

< 4 lim E[[P Z n -P\U + 2 lim e n + L*{T,P). 

n — ^00 n — >oc 

The first limit on the right-hand side of this inequality is zero 
by the GC property, while the second one is zero since e n — > 0. 
Thus, limn^ooE L(e„,/ n ,P) < L*(T,P). ■ 
We can give one particular example when condition ( 13.31 ) 
will hold. Given any two probability measures P, Q on (Z, A), 
define the variational distance between them as 

\\P-Q\\ V ± sup y>(Ai)-Q(A)|, 



where the supremum is over all finite ,4-measurable partitions 
of Z. Then we can define the covering numbers AV(e, P) and 
the Kolmogorov e-entropy Py(e,P). Now suppose that there 
exist some constants C > and a > 0, such that Py (e, P) < 
C(l/e) Q for small enough e. This will be the case, for instance, 
when Z is a compact subset of a Euclidean space and all P G 
P have Lipschitz-continuous densities w.r.t. some dominating 
measure v, and all the Lipschitz constants are all bounded by 
some L < +00 [7]. Then, since ||P - P'\\ F < B\\P - P'\\ v 
for all P,P' G V, we will have H v (e,P) < C'(l/e) a with 
C = C'(C,B,a). Then, choosing e„ = 1/logn, we will 
have H r {e ni P) < C'(logn) a = o{n). 

C. Type II schemes 

The case of Type II schemes is radically different. Whereas 
in a Type I scheme the encoder can use the training data to 
estimate the underlying distribution and then communicate its 
finite-rate description to the learner, in a Type II situation the 
encoder can only estimate the y-marginal. Unless the distri- 
butions in V can be reliably identified from their F-marginals 
(which is a very restrictive condition), the encoder does not 
have enough "learning" ability to estimate the underlying 
distribution. Instead, we will take the following approach. 

Given A > 0, let us suppose that, for each n, the encoder 
can implement a mapping Y n 1— > Y n , such that, whenever 
the training data are drawn from some P G P (unknown to 
both the encoder and the learner), the empirical distribution 
P( X „ Y") on avera g e > at most A/4 away from P in the 
|| • || j7 sense, and that -nT 1 log |F"(3^™)| < R. Then the encoder 
communicates a binary description J of Y n at rate < R to the 
learning agent, who decodes it to get Y n and then implements 
the following two-step procedure: 

P = argmin||P fx „ yn) - P\\ F , f n = argminP(/). 

pev y ' ' fer 

Then essentially the same technique as in the proof of Theo- 
remEUwill give us EL(e„,/„, P) < L*(J 7 ,P) + A for every 
P G V, thus establishing the existence of a scheme operating 
at rate R and achieving an excess loss of < A on each P G P. 

These considerations motivate the definition of the following 
nth-order operational distortion-rate function: 

B n (P,T,R) = inf sup E P ||P, ?nf - P\\ F , (3.4) 

p e -p V • V 11 

where the infimum is over all Y n : y n — > y n , such that 
n- 1 log \{Y n (y n ) : y n G y n }\ < R. We also define the 
limiting operational distortion-rate function 

%{V,T,R) = lim B n (P,F,R). 

n — ^oo 

We now state the achievability result for Type II schemes 
in terms of these operational quantities: 

Theorem 3.2. Given any R > 0, the pair (R, 4B(P, F, R)) is 
achievable. 

Proof: For each n, let Y™ : y n — > y n be the encoder 
that achieves the infimum in (13.4-b . Let {y"(l), . . . ,y n (M n )} 



be some arbitrary enumeration of its codewords. Then we 
construct the following scheme: 

. encoder — e n {Y n ) = J, such thatJJ l (F") = y"(J). 
• learner — f„(J,X n , •) = argminP(/), where 

P = argmin||P (x „.^ (J)) - P\\ F . 
Pev 

The scheme {(e„, /n)}^=i operates at rate R owing to the 
fact that rT 1 log M n < R. As for the excess loss, we have 

L(e n J n ,P) = P(f n ) 

< 2\\P-?y + L*(?,P) 

< 2\\P-P [x ^y HJ)) \\ T 

+2\\P (X «^ {J)) -Py + L*(F,P) 

< 4[[P-P (x » iSf » (J)) ||^ + i*(^,P) 

= 4\\P-P (xn<9HYn)) y + L*(F,P). 

Taking expectations, using the fact that each Y™ achieves the 
nth-order optimum B) n (P, J 7 , R), and then taking the limit as 

n — > 00, we get 

EL(e„,/„,P) <L*(^,P)+4B(P,^,P), VP G P 

which proves the theorem. ■ 
We would like to express H>(P, R) purely in terms of 
information-theoretic quantities. It is relatively straightforward 
to derive an information-theoretic lower bound on H)(P, J 7 , R). 
To that end, we will draw upon recent work of Kramer and 
Savari [2] on rate-constrained communication of probability 
distributions. The following properties of | • \ \ F are immediate: 

1) || -P - Qh < 2P- for all P, Q G M(Z). 

2) For a fixed P, the mapping Q 1— > \\Q — P\\f is Lipschitz 
in the variational norm || • ||y: for all Q, Q' G A4(Z) 

|||P-Q||^-||P-Q'IW<5||Q-QV 

3) The mapping Q m ||Q — P\\r is convex: for any Q = 
XQi + (1 - A)Q 2 with some A G [0, 1] and Qi, Q2 G 
M(Z), 

\\Q -Py< A||Qi - P||^ + (1 - A)||Q 2 - P\\ F . 

Then for each P G P the mapping Q G M{Z) \-+ \\Q - P\\ r 
satisfies the requirements listed in Section III of [2]. Thus, 
following Kramer and Savari, we can define, for every P G P 
and every R > 0, the distortion-rate function 

D KS (P,F,R)=M\\P XU -P\\ F , (3.5) 

where the infimum is over all distributions of the triple 
(X, Y, U) G X x y x y, such that P XY = P X -^Y ->U, 
and I(Y: U) < R. Kramer and Savari deal only with the case 
when X and y are both finite. However, it can be shown that 
(13.5b is equal to B>(P, J 7 , R) for general X, y as well when 
P is a singleton, P = {P}. The proof of this fact (omitted 
for lack of space) relies on the GC property ( 13. 2t and on a 
straightforward extension of the "piggyback coding" technique 



B(P,P,P) < sup inf sup inf sup E PxQ || J (XjE/) - P||^. (3.7) 

a >0 <5>0 P'CM(y) Q U\Y* P<=T>: 

HP'xQ ulY )<R + c, \\P Y -P'\W< S 



of Wyner [8, Lemma 4.3] to general (non-finite) alphabets. 
Moreover, when |P| > 2, we have the following lower bound: 

Theorem 3.3. %{V,T,R) > sup D K s(PP,P) 

PGP 

Proof: Fix any code Y™(-) of rate R that achieves 

B n (V,T, R): 



sup Ej 

Per 



-P||^ = ©„(P,P, P). 



Fix some P G P and let P x „ © denote the joint distribution 
of pQ, Y i5 Y;)jwhen pf l5 Yi), . . . , (X n , Y n ) are i.i.d. accord- 
ing to P, and Yi denotes the ith component of Y n (Y n ). Also, 
define the random variables X £ X, Y £ y, and U G y with 
the joint distribution 



Px,Y,u 



1 " 



Then Pg -p = P and that X — > Y — > [/. Using convexity and 
the fact that Esup^ 6J r[-] > supy eJ r E[-], we get 



lp\\Pl X n9»)-P\\r>\\Px,u-Py- 



That is, HPy^ ~ P\\r < 6»(P,P,P) for all P G V. 
Moreover, steps similar to those in [2, Thm. 1] give 

n 

nR > H(Y n ) > I(Y n ; Y n ) > ^ I(Y i; Y t ) > nI(Y; U). 

i=l 

Thus, we have found a triple of random variables (X,Y,U) G 
X x y x y, such that: (i) P My = P, (ii) X -> Y -> U, (iii) 



IP 



x,u 



PI 



< 



^V,T,R), (iv) I(Y;U) < R. Hence, 
for every P G P, §„(P,P,P) > D KS {P,T,R). Taking the 
supremum over all P G P and then the limit as ri — > oo, we 
get the desired result. ■ 
However, it is not straightforward to derive an information- 
theoretic upper bound on D(P,P, R). This would require 
constructing a rate-P code that asymptotically achieves 
D(P,P, P). In order to prove achievability, one could take 
a rate-P code for each "representative" distribution in P 
(assuming P is not too rich, so that it can be represented 
by a slowly, e.g., subexponentially, growing number of dis- 
tributions), combine the codes into a union code (which will 
result in an asymptotically negligible rate overhead), and then 
devise a rule for mapping the sequence Y n into one of the 
codewords. However, the difficulty here is that the encoder can 
only estimate the Y-marginal of the underlying distribution 
and cannot select the right code based on this information 
alone. One (suboptimal) strategy is to bound the distortion 
||Ppf„ Y n) ~ P\\r by the average of single-letter functions of 
the form pjr P (X l ,Y i ) = \\S {x . y^ - P\\r, where 6^,%) is 



the Dirac measure concentrated on (JQ, Yi), and consider the 
new problem of finding 



inf sup Ep 



1 - 

n £ — ' 



prAXi,Yi) 



(3.6) 



where the infimum is over all rate-P codes Y" : y n — > 
y n . Then d3T6j» will be an upper bound on B„(P,P,P). 
Note that the problem of minimizing ( 13.6b is an instance of 
minimax noisy source coding [9]: given a sequence of i.i.d. 
samples (X±, Yi), (X2, Y2), . . . from an unknown P G P and 
a blocklength n, we wish to code Y" using a rate-P code, 
such that the sequence X n is reconstructed from the encoded 
data with small average pp.p(-, •) distortion. When y is finite, 
a type-covering argument, as in [9], can be used to show 
d3.71 > at the top of this page (details are omitted for lack 
of space). Given any a > 0, S > 0, and P' G M(y), the 
second infimum in (13.7b is over all conditional probability 
distributions (transition kernels) from y to y, such that the 
mutual information between Y and U when Y ~ P' and 
U\Y ~ Qu\y> is at most R + a. The inner supremum is 
over all probability distributions P G P, such that their Y- 
marginal Py is within 5 from P' in the variational norm 
|| • || v, P x Qu\y denotes the joint distribution of X, Y and 
U when (X,Y) ~ P and C7|Y ~ Qu\Yi an d $(x,u) denotes 
the Dirac measure concentrated at (X, U) G X x y. We leave 
the problem of tightening (13.7b for future work. Evidently, the 
difficulties involved in extending this technique to general 3^ 
are of the same nature as in [9] and have to do with finding 
the right topology on Ai (y) that would give the same uniform 
error bounds as for the variational distance in the finite case. 
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